The goal of our project is to determine whether there is a correlation between a county’s suicide rate and various geographic, environmental, political, and socioeconomic indicators. Through our analysis, we hope to promote greater awareness for mental health and especially suicide. As students, we have noticed the consistently rising suicide rates and prevalence of mental health issues among high schoolers—both of which have been accentuated by the COVID-19 pandemic. In fact, the city of Palo Alto—which neighbors Stanford University and is home to numerous nationally ranking high schools—sees a teen suicide rate that is more than four times the national average (Chawla). Just this past May, a student in the Mountain View–Los Altos Union High School District took their own life, marking the third consecutive year a student has done so. Ultimately, we want to provide better insight to local and state governments along with non-profit organizations so that they can properly allocate mental health resources. We collected data from several credible sites such as the Centers for Disease Control and Protection (CDC), the United States Department of Agriculture (USDA), and the United States Environmental Protection Agency (EPA). After the data cleaning process, we were left with 459 of the original 991 counties from the CDC dataset because many counties whose death count was less than 20 were marked as “unreliable.” In our analysis, we used various data analysis methods such as Trees, Random Forest, and LASSO in order to determine which variables are significant in determining suicide rates. Finally, we evaluated and compared the various models we created as a means to decide which model performed the best.
The data on suicide rate was collected from the Centers for Disease Control and Protection (CDC) and was taken specifically from the year 2019. It should be noted that the crude and age-adjusted rate for 451 entries from this dataset were deemed as “unreliable” by the CDC, meaning that the death count for those counties were less than 20. For those specific counties, we chose to replace “unreliable” with NA.
The data on the average temperature and precipitation for 2019 was taken from the National Oceanic and Atmospheric Administration (NOAA). Because the Location.ID is different from the FIPS code (since it contains the state abbreviation), we simply removed the numbers following the state abbreviation. This column was then named as “State.”
The data on Air Quality Index (AQI) was taken from the United States Environmental Protection Agency (EPA). In order to merge this data with the one from the CDC, we had to add the word “county” (or “parish” for Louisiana and other certain specifications) to the end of each county. We also had to convert the state name to its abbreviation.
Our socioeconomic data was taken from the Economic Research Service under the US Department of Agriculture (ERS USDA). We simply chose the variables we believed could influence the suicide rate, such as information about income, jobs, education, etc.
No cleaning needed to be done for the political data. We gathered this dataset from the 2020 Presidential Election Results by county.
After cleaning each dataset individually, we then merged all of them together using the state, county, or FIPS column(s) when necessary. For the datasets that did not include a FIPS code, we had to use the state and county columns in combination during the merging since several states had counties with the same name. Afterwards, we chose to remove all rows containing NA. As a result, we kept 459 out of the 991 original counties from the CDC data.
It should be noted that since we all worked separately on different parts of this section, in our code, the merged_data variable changes between the EDAs. ## Suicide Rate Data
We begin our exploratory data analysis by delving into our response variable: the age adjusted suicide rate by county. It is important that we us the age adjusted rate because the age profiles of different counties are varied. As a result, we must take age into account in order to be able to make unbiased comparisons across difference counties.
As you can see from the histogram of the age adjusted rate below, the distribution is quite right skewed. Furthermore, it seems to be centered around 15.
Next, we create a county level heatmap displaying the age adjusted suicide rate. We can see that, in general, counties seem to be clustered in groups that seem to all have similar age adjusted suicide rates. For example, much of Southern California seems to have an age adjusted suicide rate between 10 and 20. On the other hand, counties in South-west Oregon seem to all have a higher age adjusted suicide at around 35.
The average temperature heatmap shows only the counties where suicide rate was collected. All temperatures are shown in degrees Fahrenheit. As expected, it seems like counties in the South received the highest average temperatures in 2019, and counties in the North had the lowest average temperatures.
The heatmap shown above depicts the total precipitation in inches from the year 2019 in each county. It seems like counties in the East, particularly the Southeast, received the most precipitation, and counties in the Southwest received the least. This could be explained by prior knowledge of the Southwest’s arid environment.
We decided to make the graph on the variables on AQI data since we thought that quality of air could impact suicide rate. The variables we chose were the median AQI and the percent unhealthy days. As we can see from the scatter plot of Age_Adjusted_rate versus Median.AQI, we can see that there is no direct trend between the two variables. The points on the scatter plot are extremely random, demonstrating that Median AQI doesn’t significantly impact the suicide rate. Similarly, the scatter plot showing Age_Adjusted_rate versus per_unhealthy_days shows no correlation between the two variables.
We then turn our attention to our socioeconomic variables. Two variables of interest are education and poverty.
For education, we will take a look at the percentage of adults over 25 who’s education level is only some college. From the scatter plot of that variables vs the age adjusted suicide rate, a moderate, positive, and linear relationship can be seen.
Next, we look at the percentage of all people in the county living in poverty. From the scatter plot of poverty vs the age adjusted suicide rate, a weak, positive, and linear relationship can be seen between the two variables.
The heatmap below shows which party each county voted for in the 2020 Presidential Election. Counties shown in red indicate that its citizens voted for the Republican Party, and counties shown in blue voted for the Democratic Party.
We first read in the merged data and remove the X, State, County, and FIPS data. Next, we drop all the NA values and split the dataset in the following manner: 75% for training and 25% for testing.
Our first model we create is a single decision tree using all predictors available. According to the tree, the variables that are significant in predicting a county’s age adjusted suicide rate are: Population, Total_Precipitation , Net_International_Migration_Rate_2010_2019, Ed3SomeCollegePct, HispanicPct2010, BlackNonHispanicPct2010, AvgHHSize, PctEmpMining, and Avg_Temperature.
From the leftmost end node, we can see that, according to the decision tree, counties with small populations (less than 80612) tend to see the highest age adjusted suicide rates. This seems to indicated that rural communities may be at a greater risk of suicide. This may be due to the fact that rural communities have many barriers to accessing quality healthcare as well as being less educated on mental health, leading to stigma.
Next, we take the variables determined to be significant by the decision tree and create a linear model from them. According to the linear model, Net_International_Migration_Rate_2010_2019 and HispanicPct2010 are not significant at a 10% alpha level.
Furthermore, when controlling for all other variables, a change in AvgHHSize has the largest impact on a county’s age adjusted suicide rate. That is, when AvgHHSize increases by one unit, a county’s age adjusted suicide rate seems to decrease by 7.87. This may be due to the fact that larger households are better equipped to offer social and emotional support to each household member, decreasing the risk of suicide.
The third model we create is a random forest model. We set mtry to be 11 as we have 34 independent variables. Ntree is also set to be 500 because, as you can see in the graph below, the error seems to settle down around there.
We used LASSO to create a regression model because this method restricts the total number of variables in the equation, creating a tighter model. As a result of LASSO, 12 variables were removed, resulting in a total of 22 variables.
Afterwards, we did backwards selection, also known as relaxed LASSO, to confirm that our model only included variables that were significant at the 0.05 level. During this process, we removed another 11 variables, reducing our model down to 11 variables.
Our model does follow the linear model assumptions. The residual plot follows a relatively random pattern and is scattered evenly above and below the h=0 line, demonstrating linearity and homoscedasticity. The dots on the qqplot are in a roughly straight line, proving normality.
Using the LASSO equation, we are able to analyze the effect of an explanatory variable on suicide rate, while controlling for other variables. For example, for every 0.6391 increase in percent of Native American Non-Hispanic people in the county, there is an increase in one death per 100k in suicide rate.
Now that we have created all four models, it is time to evaluate and compare them. After running the models on the testing dataset and extracting their mean squared errors, we can see that random forest formed the best, with a mean squared error of only 15.4. On the other hand, the testing error for the linear model based off of our tree’s significant variables performed the worst, with a testing error of 25.8. Lasso and the single tree did not perform much better, with a testing error of 25.2 and 24.8 respectively. Although the lasso model did not perform that well, we still think it is important as it provides us with a clear and informative linear relationship between the variables.
In summary, our project revealed many variables that correlate with the age-adjusted suicide rate across US counties, some of which could be predictors of an increase or decrease in suicide rate. We hope that our research will provide more information about what factors influence suicide rate and be an indication to leaders, both local and national, about what could be changed to decrease suicide rate.
First, socioeconomic factors seem to have a strong correlation with suicide rates. Many of our significant variables from our Tree and relaxed LASSO models are from our socioeconomic data. This means that when leaders develop plans to combat suicide rate, consideration of socioeconomic factors should be prioritized.
Second, correlation does not imply causation. The influence of some variables, such as Avg_Temperature, on county-level suicide rates may be difficult to pinpoint. Therefore, it is important to mention that variables in our study which we have found to correlate with suicide rate will not necessarily have a direct impact. There could be internal correlation between those variables and another variable that does directly impact suicide rate. Furthermore, for our Single Tree and Random Forest models, there is no way to tell whether or not a significant variable impacts suicide rate in a positive or negative way.
Finally, more mental health funding and support should be allocated for Native and Indigenous communities and areas with education barriers. This is due to the fact that some of our significant variables included the percentage of people in minority communities, particularly Native and Indigenous people, and people with relatively lower education levels.
Chawla, Ishika. “CDC Releases Preliminary Findings on Palo Alto Suicide Clusters.” The Stanford Daily, 21 July 2016, www.stanforddaily.com/2016/07/21/cdc-releases-preliminary-findings-on-palo-alto-suicide-clusters/.